3. Analysis of Research Resources using RRIDs


Overview

This section presents the analysis of research resources using the Research Resource Identifiers (RRIDs).

Before directly diving into the actual analysis, first exploratory data analysis has been done to understand the structure of data, file format, number of journals, number of publications, etc. This step was found to be important to optimize the code and for creating parallel processes for RRID extraction in the second step to speed up the RRID extraction.

In the second step, different factors have been taken into account to formulate a regular expression (regex) formula. Then the regex formula is used to extract RRIDs keys from 3.2 million publications ( available in XML format). The RRID extraction has been done using multi-processing with 32 parallel processes. The intermediate result of the RRID extraction is then stored in a nested dictionary as shown below:

Then the dictionary file is saved as a JSON file. Further information about the publications like electronic publication date epub, article title, etc. has also been retrieved from the XML files and saved as a CSV file for analysis. After that, RRID analysis and visualization have been done using the extracted RRIDs ( in JSON and CSV files ). Finally, facts of the data analysis have been summarized and some recommendation is also made for future work.  

The sequence of steps used in this project is shown below:

Note:

  • The analysis in this project is made in a procedural manner so that results shown in this work are reproducible.

  • The publications are available at /mnt/nvs3/nlp-public/Pubmed/pubmed_xml/ for elaine Server.


3.1. Exploratory Analysis of PubMed Publications


Exploring Data, often known as Exploratory Data Analysis (EDA), is an important step at the initial phase of data analysis. EDA helps to understand the structure of the data, gives a glimpse into the main features of the data such as data format, structure, etc. [37]

In [1]:
# import essential libraries
import os
from os import listdir
import simplejson as sj
import json
import xml.etree.ElementTree as ET
import concurrent.futures
import time
import re
import operator
import pandas as pd

How many Journals do we have in our data set?

In [2]:
from os import listdir

#Create list of pmed_directories

pmed_directory_list = listdir("/mnt/storage1/nlp/pubmed_xml/")
pmed_directory_list.sort()
print(f"We have {len(pmed_directory_list)} Journals in the PubMed.\n")

# sample PubMed Journal directories
print ("Sample Journal names:")
pmed_directory_list[:5]
We have 15259 Journals in the PubMed.

Sample Journal names:
Out[2]:
['100_Tips_to_Avoid_Mistakes_in_Academic_Writing_and_Presenting',
 '130_Years_of_Medicine_in_Hong_Kong',
 '13th_International_Conference_on_Theory_and_Application_of_Fuzzy_Systems_and_Soft_Computing_—_ICAFS-2018',
 '2011_International_Conference_in_Electrics,_Communication_and_Automatic_Control_Proceedings',
 '20_Century_Br_Hist']

What is the file format of publications?

In [3]:
import os

# List all files in a directory 
xml_samle_file_path = "/mnt/storage1/nlp/pubmed_xml/Zygote"
with os.scandir(xml_samle_file_path) as entries:
    for entry in entries:
        if entry.is_file():
            print(f'file format: {entry.name}')
file format: PMC4413872.nxml
file format: PMC4501373.nxml
file format: PMC4416383.nxml
file format: PMC4697306.nxml
file format: PMC4413873.nxml

How many publications are there in total?

In [4]:
xml_list = []
path ="/mnt/storage1/nlp/pubmed_xml/"

for dir_ in pmed_directory_list[:]:

    xml_path = path+dir_         #creates path to the directory that contains xml files
    if os.path.isdir(xml_path):  # check if a directory exists
        for xml_file in os.listdir(xml_path):
            if not xml_file.endswith('.nxml'): continue  # skip non xml files
            xml_list.append(xml_file)

print(f"Total Number of publications: {len(xml_list)}")
Total Number of publications: 3227590

What are the biggest journals by number of publications?

To find the biggest journals, that have the largest number of publications:

  • Map each XML publication with its parent journal using the mapper_function defined below.
  • Count the number of publications in each Journal.
  • Sort the result in descending order and take the top 10 results.
  • Visualize the result using a bar graph.
In [5]:
# define a function that mapps each publication to its Journal 

def mapper_function(dir_):
    path = "/mnt/storage1/nlp/pubmed_xml/"
    map_dict = {}

    #create a path to the directory 
    xml_path = path+dir_

    xml_list = []
    if os.path.isdir(xml_path):  # check if a directory exists
        for xml_file in os.listdir(xml_path):

            # skip if the file is not xml
            if not xml_file.endswith('.nxml'): continue

            # crate a path to the xml files
            xml_file_path = os.path.join(xml_path, xml_file)

            # list of publications in a journal
            xml_list.append(xml_file)

            # store in the dictionary
            map_dict[dir_] = xml_list

    return map_dict
In [6]:
# map each xml publication to its parent Journal using ProcessPoolExecutor

import concurrent.futures

Journal_xml_mapping_dict = {}
with concurrent.futures.ProcessPoolExecutor() as executor:

    # creates 32 processes 
    for i in range(0,len(pmed_directory_list),32):

        for dir_rk in pmed_directory_list[i:i+32]:
            result_list = []
            results = executor.submit(mapper_function, dir_rk)   # returns directory-to-publication mapping
            result_list.append(results)                          # stores the processes reults in the list 

            for f in concurrent.futures.as_completed(result_list):
                f.result()
                Journal_xml_mapping_dict.update(f.result())  # save the Journal_xml_mapping_dict as json file

Counting Number of publications in each journal - journal size

In [7]:
# save the number of publications in each Journal as journal_size_dict 

journal_size_dict = {}

for journal in list(Journal_xml_mapping_dict):
    journal_size_dict[journal] = len(Journal_xml_mapping_dict.get(journal))   # save the journal_size_dict as json file



# sort journals in a descending order by publication count
Journal_size_desceding_order = dict(sorted(journal_size_dict.items(), key=lambda item: item[1], reverse=True)[:])


# Take the top 10 JOurnals by publication  
Top_10_RRID_Journals_desc= dict(sorted(Journal_size_desceding_order.items(), key=operator.itemgetter(1), reverse=True)[:10])
In [8]:
# Visualization of top 10 journals by RRID count

import matplotlib.pyplot as plt

# plot color style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14,7))

# plot bars
ax.bar(list(Top_10_RRID_Journals_desc.keys()), list(Top_10_RRID_Journals_desc.values()))

# annotate the plot
ax.set_ylabel("Number of Publications")
ax.set_xlabel("Journals")
ax.set_title("Top 10 Journals by no. of publications")
ax.set_xticklabels(list(Top_10_RRID_Journals_desc.keys()), rotation=15)

fig.savefig("Plot: 1.Top 10 Journals by no. of publications.png", dpi = 130)
plt.show()

 Comparison of Journals' size using a Bubble chart

Using the result found above, journals with publication count in them journal_size_dict:

  • Plot a bubble graph to compare the size of journals
  • Place each bubble in a randomly generated (X, Y) co-ordinate in a 2D-plane.
  • The largest bubbles correspond to the journals with a large number of publications.

Note: Only journals with more than 2,000 publications have been visualized below.

In [9]:
import pandas as pd
import numpy as np

np.random.seed(42)
N = 15167

# generate random x and y to place the bubble in the x-y plane

x = np.random.normal(170, 20, N)
y = x + np.random.normal(5, 25, N)
colors = np.random.rand(N)
Journals = np.array(list(journal_size_dict.keys()))
count = np.array(list(journal_size_dict.values()))

# create a data frame

df = pd.DataFrame({'X': x,'Y': y,'colors':colors,'Journals': Journals,"Number of Publications":count})

# subset journals with more than 2k publications
df_2k = df[df['Number of Publications'] > 2000 ]
df_2k

#  visualize the result
import plotly.express as px
import plotly.graph_objects as go

fig = px.scatter(df_2k, x="X", y="Y",size="Number of Publications",hover_name="Journals",log_x=True, size_max=60)

fig.show()

3.2. RRID Extraction from  publications using Regular Expressions


In this project Research Resource IDentifiers (RRIDs) have been extracted using regular expressions (RegX).

3.2.1. Why regular expressions for RRID extraction?

Regular expressions are very powerful to extract information from text files particularly when text information has a specific pattern that can be generalized. Since RRID keys have a unique syntax of the form RRID:prefix_Identifier, regular expressions can be used to effectively retrieve all RRID citations from publications as long as the RegX formula can handle all variants of RRID resource citations.  

3.2.2. Specification of RegX search pattern

To capture as many RRIDs as possible, our RegX formula needs to be robust and must be able to generalize to all forms and variations of RRID citations.

Even though RRIDs have a specific syntax which we can specify using RegX right away, it is required to consider how authors might use RRIDs. For instance, in a sample publication (can be found here), the author intentionally ignored the RRID part when citing the resource as shown on the figure below:

On the other hand, we have some research resources where the prefix is composed of a set of strings separated by a special character. For instance: RRID:IMSR_JAX:000664 has slightly a different prefix pattern compared to RRID:AB_10564097.

Therefore, to effectively retrieve all RRID citations in the publications, the two factors have been considered in this work as follows:

    I.  Inconsistencies of RRID use by authors and

    II. Variants of RRID prefixes. 

I. Inconsistencies of RRID use by authors

Three types of inconsistencies that occur when authors use RRIDs in their citations are the inconsistency of Case, Space and Ignored part from the RRID-Syntax.

Case Inconsistency: upper Case and Lower Case - to handle latter case-related inconsistencies, the case ignore flag has been used with the regex.

Space Inconsistency: The most common space inconsistency observed is one-space after RRID:. Some authors included one-space after RRID:. For example: In a sample paper (can be found here), the author added space in 3 instances of RRID citations and no space in one instance as shown below:

Ignored Part from the Syntax

Usually, authors included the RRID-Part and only in fewer cases authors ignored the RRID: Part. For this reason, the regular expression search pattern is formulated to handle both cases (with/without RRID-Part). More elaborate analysis for Ignored/Included  RRID part has been presented in the RRID analysis part.

II. Varients of RRID prefixes

Some RRID patterns are slightly different from most RRID patterns by their RRID Prefix orientation. For example: RRID:IMSR_JAX:033255 vs RRID:AB_10564097. To accommodate this factor into our regular expression search pattern,  a list of RRID prefixes [31] has been used to capture all variants of RRIDs from the scientific publications. 

The RegX formula

After Considering all the above factors (all the inconsistencies and varients of RRIDs), a RegX search pattern for RRIDs has been formulated for all variants of RRIDs as follows:

RRID: ?SCR_[0-9]{4,}|(?!RRID:) ?SCR_[0-9]{4,}|RRID: ?OMICS_[0-9]{4,}|(?!RRID:) ?OMICS_[0-9]{4,}|RRID: ?AB_[0-9]{4,}|(?!RRID:) ?AB_[0-9]{4,}|RRID: ?CVCL_[0-9]{4,}|(?!RRID:) ?CVCL_[0-9]{4,}|RRID: ?CVCL_[A-Z]+[0-9]+|(?!RRID:) ?CVCL_[A-Z]+[0-9]+|RRID: ?BDSC_[0-9]{4,}|(?!RRID:) ?BDSC_[0-9]{4,}|RRID: ?RGD_[0-9]{4,}|(?!RRID:) ?RGD_[0-9]{4,}|RRID: ?IMSR_JAX:[0-9]{6,}|(?!RRID:) ?IMSR_JAX:[0-9]{6,}|RRID: ?Addgene_[0-9]{4,}|(?!RRID:) ?Addgene_[0-9]{4,}|RRID: ?DGGR_[0-9]{4,}|(?!RRID:) ?DGGR_[0-9]{4,}|RRID: ?EXRC_[0-9]{4,}|(?!RRID:) ?EXRC_[0-9]{4,}|RRID: ?NSRRC_[0-9]{4,}|(?!RRID:) ?NSRRC_[0-9]{4,}|RRID: ?MGI_[0-9]{4,}|(?!RRID:) ?MGI_[0-9]{4,}

Note: The above regX search pattern is a truncated version after removing RRIDs which are not found on the Sci-Crunch.

3.2.3. RRID Extraction

Using the RegX formula specified above, extraction of RRIDs from the XML publications has been done following the steps below:

  1. Parsing of XML publications - using ElementTree
  2. Converting the XML file to String
  3. Searching for RRID keys using the regular expression formula specified above.
  4. Storing the search result in the dictionary and saving the dictionary as JSON file for later analysis.

RRID_extractor function

The RRID_extractor function is an implementation of the above pipeline. To speed up the RRID extraction process, multiple calls have been made to this function using a process pool executor.  

In [10]:
# RRID_extractor function -- extracts RRIDs from each Journal

def RRID_extractor(dir_):
    import re
    import xml.etree.ElementTree as ET
    path = '/mnt/storage1/nlp/pubmed_xml/'

    # regular expression formula 
    all_rrid_search = '''RRID: ?SCR_[0-9]{4,}|(?!RRID:) ?SCR_[0-9]{4,}|RRID: ?OMICS_[0-9]{4,}|(?!RRID:) ?OMICS_[0-9]{4,}|RRID: ?AB_[0-9]{4,}|(?!RRID:) ?AB_[0-9]{4,}|RRID: ?CVCL_[0-9]{4,}|(?!RRID:) ?CVCL_[0-9]{4,}|RRID: ?CVCL_[A-Z]+[0-9]+|(?!RRID:) ?CVCL_[A-Z]+[0-9]+|RRID: ?BDSC_[0-9]{4,}|(?!RRID:) ?BDSC_[0-9]{4,}|RRID: ?RGD_[0-9]{4,}|(?!RRID:) ?RGD_[0-9]{4,}|RRID: ?IMSR_JAX:[0-9]{6,}|(?!RRID:) ?IMSR_JAX:[0-9]{6,}|RRID: ?Addgene_[0-9]{4,}|(?!RRID:) ?Addgene_[0-9]{4,}|RRID: ?DGGR_[0-9]{4,}|(?!RRID:) ?DGGR_[0-9]{4,}|RRID: ?EXRC_[0-9]{4,}|(?!RRID:) ?EXRC_[0-9]{4,}|RRID: ?NSRRC_[0-9]{4,}|(?!RRID:) ?NSRRC_[0-9]{4,}|RRID: ?MGI_[0-9]{4,}|(?!RRID:) ?MGI_[0-9]{4,}'''

    # comile the regx formula
    PatternW= re.compile(all_rrid_search, re.IGNORECASE)

    xml_rrid_dict = {}      # stores a mapping of xml publications to the list pf RRID key citations
    directory_xml_dict = {} # stores  a mapping of journal to xml_rrid_dict dictionary

    xml_path = path+dir_  # path to each journal
    if os.path.isdir(xml_path):  # check if a directory exists
        for xml_file in os.listdir(xml_path):
            if not xml_file.endswith('.nxml'): continue  # check is the file is xml
            xml_file_path = os.path.join(xml_path, xml_file)  # create path to the xml file

            #parse the xml 
            tree = ET.parse(xml_file_path)
            root = tree.getroot()

            #convert xml to a string
            string_root = ET.tostring(root, encoding='utf8').decode('utf8')

            #serach for rrid in the string
            match = re.findall(PatternW, string_root)

            if len(match) != 0: # if RRID is found in the publication , store the result in the dictionaries
                match_list = []
                for mat in match:
                    match_list.append(mat)

                    # save the rrid result in the dictionaries
                    xml_rrid_dict[xml_file] = match_list
                    directory_xml_dict[dir_] = xml_rrid_dict

            else:           # if RRID is fNOT found ,do nothing
                pass

    return directory_xml_dict
In [11]:
#  mapper_func function -- maps each XML publication to its parent Journal. 

def mapper_func(dir_):
    path = "/mnt/storage1/nlp/pubmed_xml/"
    map_dict = {}
    xml_path = path+dir_         #creates path to the directory
    xml_list = []
    if os.path.isdir(xml_path):  # check if a directory exists
        for xml_file in os.listdir(xml_path):
            if not xml_file.endswith('.nxml'): continue       # check if the file is XML
            xml_file_path = os.path.join(xml_path, xml_file)  # creates path to the xml file

            xml_list.append(xml_file)
            map_dict[dir_] = xml_list                          # Journal xml mapping
    return map_dict

Extracting RRID keys from  XML publications

Using the ProcessPoolExecutor 32 parallel processes have been created by making multiple calls to the RRID_extractor function. Each process extracts RRID keys from one journal at a time. 

In [12]:
# using 32 parallel processes to extract RRIDs 

from os import listdir
import concurrent.futures
import operator
import os

# create a dictionary that stores the RRID extraction result 
result_dict_2 = {}

#Create list of directories
pmed_directory_list = listdir("/mnt/storage1/nlp/pubmed_xml/")

with concurrent.futures.ProcessPoolExecutor() as executor:
    for i in range(0,len(pmed_directory_list[:32]),32):  # the first 32 Journals -> pmed_directory_list[:32]

        result_list = []
        for dir_rk in pmed_directory_list[i:i+32]:  # Create 32 processes 

            results = executor.submit(RRID_extractor, dir_rk)   # returns journal_publication_rrid dictionary
            result_list.append(results)

        for f in concurrent.futures.as_completed(result_list):
            for key_ in list(f.result().keys()):
                result_dict_2[key_] =f.result().get(key_)

print("RRID Search result in =>:", result_dict_2)
RRID Search result in =>: {'Cell_Death_Discov': {'PMC5856839.nxml': ['RRID:AB_10063408', 'RRID:AB_331460', 'RRID:AB_1846356', 'RRID:AB_2341188', 'RRID:AB_477010', 'RRID:AB_399621', 'RRID:AB_2298772', 'RRID:AB_1846356', 'RRID:AB_331426', 'RRID:AB_262011'], 'PMC7575537.nxml': [' AB_10015289', ' AB_2313567'], 'PMC6155004.nxml': [' AB_2147781'], 'PMC7028721.nxml': ['RRID: SCR_016391', 'RRID: SCR_008452', 'RRID: SCR_002798'], 'PMC6768854.nxml': ['RRID: Addgene_10878']}}

Extracted RRID results from the above cell have been saved in a JSON file _3.Journal_xml_RRID_final_Result.json.

Note: For the purpose of demonstration, the above cell runs over a small sample of only the first 32 PubMed Journals range(0,len(pmed_directory_list[:32]),32).

It would take 3-4 days if the cell runs overall 15k journals range(0,len(pmed_directory_list[:]),32). This is due to a large number of publications, longer regX search patterns, internet speed, etc.


3.3. RRID Analysis


In this part analysis of RRIDs has been done.

3.3.1. How many RRID citations in all Journals?

In [13]:
# read previously saved file from RRID extraction

with open("/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_3.Journal_xml_RRID_final_Result.json", "r") as read_file:
    Journal_xml_RRID_final_Result = json.load(read_file)

# create a list of XML files that has RRID citation   

xml_with_RRID_list = []
for journal in list(Journal_xml_RRID_final_Result.keys()):

    for xml in list(Journal_xml_RRID_final_Result.get(journal).keys()):
        xml_with_RRID_list.append(xml)

# Create a list of all RRID keys 

all_rrid_list = []
for journal in list(Journal_xml_RRID_final_Result):
    for rrid_list in list(Journal_xml_RRID_final_Result.get(journal).values()):
        for rrid_key in rrid_list:
            all_rrid_list.append(rrid_key)

print(f"{len(all_rrid_list)} RRID citations in {len(xml_with_RRID_list)} publications, in {len(Journal_xml_RRID_final_Result)} Journals.")


# How many unique RRID citations

import numpy as np

# list out unique RRID keys from all_rrid_list 
resource_name, resource_counts = np.unique(all_rrid_list, return_counts=True)
Resource_count_dict = dict(zip(resource_name, resource_counts))

unique_rrid_list = list(Resource_count_dict.keys())

print(f"Out of the {len(all_rrid_list)} RRIDs, {len(unique_rrid_list)} RRID citations are unique. ")
152598 RRID citations in 10082 publications, in 660 Journals.
Out of the 152598 RRIDs, 45220 RRID citations are unique.

3.3.2. Web Scraping RRIDs on the Sci-crunch Resolver

What portion of unique RRID keys are found on the Sci-Crunch?

To check what portion 45,220 unique RRID keys are found on the Sci-Crunch, each RRID key is checked on the Sci-Crunch and the result has been retrieved using web-scraping. Some of the RRIDs are not found on the Sci-crunch returning 404 error as shown here.

In [14]:
import requests
import re
from bs4 import BeautifulSoup as soup

# Sci-Crunch Resolver address
resolver_url = "https://scicrunch.org/resolver/"

# create lists that store RRIDs found/not-found on the Sci-Crunch
rrids_found_on_resolver = []
rrids_not_found_on_resolver = []

# for each unique RRID key, web-scrape the result of the Sci-Crunch resolver

for rrid_ in unique_rrid_list:

    # create a link 
    resource_url = resolver_url+rrid_

    # retrrieve the text from the sci-crunch result page
    resource_page_text = requests.get(resource_url).text

    # parse the page
    page_soup = soup(resource_page_text, "html.parser")

    # for RRIDs found on the Sci-crunch html div tag has "id":"data_info" property
    info_container = page_soup.find("div", {"id":"data_info"})

    # for RRIDs NOT found on the Sci-crunch html div tag has class_="error-v3" property
    info_container2 = page_soup.find("div", class_="error-v3")

    # RRIDs found
    if info_container != None: # RRIDs found
        print(f"{rrid_} is found at {resource_url}")
        rrids_found_on_resolver.append(rrid_)

    # RRIDs not-found
    elif info_container2 != None:
        print(f"{rrid_} is NOT found at {resource_url}")
        rrids_not_found_on_resolver.append(rrid_)

    # exception
    else:
        # print the page response
        print(resource_page_text)
        pass

    break
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.16.1</center>
</body>
</html>

Note: I tried to do the web scrapping using 32 parallel processes, making 32 requests in parallel to the Sci-Crunch resolver. Since then, my server is probably blacklisted which is why the 403 Forbidden response. Therefore, I did the web scraping using my local machine, saved the results, and uploaded them to the Hamilton server.  

In [15]:
# read the results of the web_scraping

with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_5.found_rrid_list_on_SciCrunch.txt', 'r') as file:
    found_rrid_list =sj.load(file)

with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_6.missing_rrid_list_on_SciCrunch.txt', 'r') as file:
    missing_rrid_list =sj.load(file)

# calculate percentage of RRIDs found on the Sci-Crunch
Result = len(found_rrid_list)/len(unique_rrid_list)*100

print(f"From {len(unique_rrid_list)} unique RRIDs, {len(found_rrid_list)} are Found on the Sci-Crunch Resolver. \nThat is {Result} %")
From 45220 unique RRIDs, 42975 are Found on the Sci-Crunch Resolver.
That is 95.03538257408226 %

3.3.3.  How often do authors use RRIDs intentionally?

As stated before, on the regular expression specification section above, some authors did not include the RRID-Part. When a resource citation lacks the RRID-Part, it is difficult to know the intention of the author (whether or not the author is using research resource identifiers (RRIDs) or something else).

How many of RRIDs have RRID-Part?

In [16]:
# RRIDs that are found on Sci_Crunch

found_rrid_list_with_RRID_part = []     # stores a list of RRIDS that have RRID part like    RRID:Prefix_xxxxxx
found_rrid_list_with_noRRID_part = []   # stores a list of RRIDS that have NO RRID part like      Prefix_xxxxxx

for rrid_ in found_rrid_list:  # for RRIDs found on the sci-crunch

    if 'RRID' in rrid_:   # if there is RRID-Part store it into found_rrid_list_with_RRID_part list
        found_rrid_list_with_RRID_part.append(rrid_)

    elif 'RRID' not in rrid_: # if there is NO RRID-Part store it into found_rrid_list_with_noRRID_part list
        found_rrid_list_with_noRRID_part.append(rrid_)
    else:
        pass

# RRIDs that are NOT found on Sci_Crunch

missing_list_with_RRID_part = []   # stores a list of RRIDS that have RRID part like    RRID:Prefix_xxxxxx
missing_list_with_noRRID_part = [] # stores a list of RRIDS that have NO RRID part like      Prefix_xxxxxx
for rrid_ in missing_rrid_list:

    if 'RRID' in rrid_:  # if there is RRID-Part
        missing_list_with_RRID_part.append(rrid_)

    elif 'RRID' not in rrid_: # if there is NO RRID-Part
        missing_list_with_noRRID_part.append(rrid_)
    else:
        pass

print(f'From {len(found_rrid_list)} RRIDs found on the Sci-Crunch, {len(found_rrid_list_with_noRRID_part)} have "NO RRID part"  : {len(found_rrid_list_with_noRRID_part)/len(found_rrid_list)*100} %')
print(f'From {len(found_rrid_list)} RRIDs found on the Sci-Crunch,  {len(found_rrid_list_with_RRID_part)} have "RRID part", : {len(found_rrid_list_with_RRID_part)/len(found_rrid_list)*100} % \n')


print(f'From {len(missing_rrid_list)} RRIDs Not-found on the Sci-Crunch, {len(missing_list_with_noRRID_part)} have "NO RRID part"  : {len(missing_list_with_noRRID_part)/len(missing_rrid_list)*100} %')
print(f'From {len(missing_rrid_list)} RRIDs Not-found on the Sci-Crunch,  {len(missing_list_with_RRID_part)} have "RRID part", : {len(missing_list_with_RRID_part)/len(missing_rrid_list)*100} %')
From 42975 RRIDs found on the Sci-Crunch, 24354 have "NO RRID part"  : 56.670157068062835 %
From 42975 RRIDs found on the Sci-Crunch,  18621 have "RRID part", : 43.32984293193717 %

From 2456 RRIDs Not-found on the Sci-Crunch, 2062 have "NO RRID part"  : 83.95765472312704 %
From 2456 RRIDs Not-found on the Sci-Crunch,  394 have "RRID part", : 16.042345276872965 %

  A closer look to the  papers

The above result indicates that more than half of the authors (56%), did not include the RRID-Part. Is this true? What is the authors' intention?, were authors citing research resources using RRIDs or are they citing something else? Do RRID citations appear in the same way in XML files (used in this project) vs in the PDF files?

To find answers for those questions, a sample of actual papers have been investigated in 4-scenarios as follows:

i. RRIDs found on Sci-Crunch and  has "RRID-Part" :

In a sample of 8 publications (link provided) that contain RRIDs found on the Sci-Crunch with RRID-Part, typically the following things has been observed:

       
  • Authors' used RRIDs intentionally, (intended use).
  •    
  • The RRIDs appear in the same way in the XML file and in the PDF file. That is the "RRID-Part" appears following the `RRID:Prefix_XXXX` syntax.
  • A DOI link to the sample files is: 1, 2, 3, 4, 5, 6, 7, 8

  • A sample snapshot from the XML file and a PDF file from 1 is shown below:

ii. RRIDs found on Sci-Crunch and has NO "RRID-Part":

In a sample of 8 publications that contain RRIDs found on the Sci-Crunch but has NO RRID-Part typically the following things has been observed:  

       
  • Authors' used RRIDs intentionally, (intended use).
  •    
  • The RRIDs DOES NOT appear in the same way in the XML file and in the PDF file. In the  XML file, the RRID part is often placed as an attribute in XML tag, separated from the prefix_XXXX part.
  •    
  • The authors actually included the RRID-PART most of the time. This shows probably more than 43% of the RRIDs have RRID-Part.
  • Link to the sample files is 1 ,2, 3, 4, 5, 6, 7, 8

  • A sample snapshot from the XML file and a PDF file from 4 is shown below:

iii. RRIDs NOT found on Sci-Crunch and has "RRID-Part":

From a sample of 5 publications that contain RRIDs NOT found on the Sci-Crunch but have RRID-Part the following things has been observed:

       
  • Authors' used RRIDs intentionally, (intended use).
  •    
  • The resources are NOT found on the SCi-Crunch.
  •    
  • This happens usually when the author makes a mistake while citing the resource (typo or wrong prefix). See sample snapshots.
  • Links to the sample files are: 1 ,2, 3, 4, 5

  • Examples are shown on a sample snapshot from the XML file and a PDF file from 1  and 2 below:

iv. RRIDs NOT found on Sci-Crunch and has  NO "RRID-Part":

In a sample of 6 publications that contain resource citations which are NOT found on the Sci-Crunch and as well have NO RRID-Part typically the following things has been observed:

  • Authors' did NOT use RRIDs intentionally, (unintended use)
  • The resources are NOT found on the SCi-Crunch
  • The resources are most likely non-RRID resources (even non-resource at all).
  • Links to the sample files are: 1 ,2, 3, 4, 5 , 6

  • Examples are shown on a sample snapshot from the XML file and a PDF file from 3, 4, and 5 below:

From the above 4 cases, we can draw the following conclusions:

  • Most of the time authors have included the RRID-Part.

  • If RRID-Part is included and if the RRID resource is found on the Sci-Crunch, the author most likely intentionally used RRIDs.

  • If RRID-Part is included but if the RRID resource is not found on the Sci-Crunch, this means the author intentionally used RRIDs but probably made a mistake while citing the RRID key.

  • If the RRID-part is not included and if the resource is not found on the sci-crunch, this means the resource is most likely a Non-RRID resource and the author did not mean to use RRIDs.

Note: No cases have been found where the author used a Non-RRID resource which is found in the Sci-Crunch.

3.3.4. Types of research resources

Types of research resources refer to the category of a research resource as software/database, antibody, cell-line, plasmid, etc. A type of resource is specified by the prefix of its RRID key. From the list of 42,975 unique RRIDs, 16 unique types of resources have been found below :

In [17]:
#define a func that extracts a type of resource (RRID-prefix)

def unique_rrids(all_rrid_list):      # takes all lists of RRIDs extracted from all publications in 660 journals
    rrid_key_list = []

    for str_ in all_rrid_list:  # for each RRID in all_rrid_list
        if type(str_) == str:

            if 'IMSR_JAX'in str_ : # if 'IMSR_JAX' is in the RRID key
                key_ = str_.split(':', 1)[0].strip().upper()

            elif 'Addgene' in str_: #if 'Addgene' is in the RRID key
                key_ = str_.split('_', 1)[0].strip()
            else:
                key_ = str_.split('_', 1)[0].strip().upper()

                if key_ not in rrid_key_list:
                    rrid_key_list.append(key_)

        unique_key_list = []  # stores unique resource types (RRID-prefix)

        for rrid_key in rrid_key_list:
            if ':' in rrid_key:
                unique_key = rrid_key.split(':',1)[1].strip()

                if unique_key not in unique_key_list:
                    unique_key_list.append(unique_key)
            else:
                if rrid_key not in unique_key_list:
                    unique_key_list.append(rrid_key)
    return unique_key_list
In [18]:
uniqe_rrid_list = unique_rrids(all_rrid_list)

print(f'We have {len(uniqe_rrid_list)} distinct types of Resources! \n\nThese are:')
uniqe_rrid_list
We have 16 distinct types of Resources!

These are:
Out[18]:
['AB',
 'CVCL',
 'SCR',
 'MMRRC',
 'RGD',
 'BDSC',
 'OMICS',
 'MGI',
 'DGGR',
 'NXR',
 'NSRRC',
 'IMSR',
 'EXRC',
 'TSC',
 'ADDGENE',
 'AGSC']

Note: Some types of resources are not found on the Sci-Crunch. These are:

  • MMRRC - Organism Mouse
  • DGGR - Organism Fly
  • TSC - Organism Tetrahymena
  • AGSC - Organism Ambystoma

3.3.5. Share of research resource types in  Journals (Pie Chart)

From 152,598 research resource citations throughout 660 Journals:

  • 69% of the resources are antibodies.
  • 19.6% are software resources.

Steps to pie-chart:

  • create Journal- RRID_list mapping.
  • Count all resource types in all 152,598 RRID citation instances.
  • Visualize the share of top resource types. 
In [19]:
# create Journal- RRID_list mapping

with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_3.Journal_xml_RRID_final_Result.json', 'r') as file:
    RRIDs_final_Result =sj.load(file)

# create Journal_RRID mapping from Journal_XML_RRID mapping

Journal_RRID_dict = {}
for journal in list(RRIDs_final_Result):

    # unnest inner list
    Journal_RRID_dict[journal] = sum(list(RRIDs_final_Result.get(journal).values()), [])

Extract resource prefix from all 152,598 RRIDs

In [20]:
# initialize a list that stores all instances of pre-fixes (resource type)
all_resource_types = []

for j in list(Journal_RRID_dict):
    resource_type_list = []
    for str_ in Journal_RRID_dict.get(j):

        if 'RRID' in str_: # if the resource citation has RRID part

            # for RRID:IMSR_JAX:000664 format of resource
            if 'IMSR_JAX'in str_ :

                # separte the RRID-Part
                imsr_jax_num = str_.split(":", 1)[1].strip()

                # separate the IMSR Jax
                imsr_jax = imsr_jax_num.split(":", 1)[0].strip().upper()
                resource_type_list.append(imsr_jax)

            # for RRID:Addgene_000664 format of resource
            elif 'Addgene' in str_:

                # separte the RRID-Part
                addgene_num = str_.split(':', 1)[1].strip()

                # separte the addgene
                addgene = addgene_num.split('_', 1)[0].strip().upper()
                resource_type_list.append(addgene)
            else:
                # for RRID:prefix_000664 format of resource

                # separte the RRID-Part
                prefix_num = str_.split(':', 1)[1].strip()

                # separte the prefix
                prefix = prefix_num.split('_', 1)[0].strip().upper()
                resource_type_list.append(prefix)

        elif 'RRID' not in str_: # if the resource citation so not have RRID part

            # for IMSR_JAX:000664 format of resource
            if 'IMSR_JAX'in str_ :

                # separate the IMSR Jax
                imsr_jax = str_.split(":", 1)[0].strip().upper()
                resource_type_list.append(imsr_jax)

            # for Addgene_000664 format of resource    
            elif 'Addgene' in str_:

                # separte the addgene
                addgene = str_.split('_', 1)[0].strip().upper()
                resource_type_list.append(addgene)
            else:
                # for all cases prefix_1232548

                # separte the prefix
                prefix = str_.split('_', 1)[0].strip().upper()
                resource_type_list.append(prefix)
    all_resource_types.extend(resource_type_list)

Visualize share of top 6 resource types in all journals

In [21]:
# count resource types

import numpy as np

# count each type of resource
rsource_typ_labels, rsource_typ_counts = np.unique(all_resource_types, return_counts=True)

# store counts of each resource type in a dict
Resource_count_dict = dict(zip(rsource_typ_labels, rsource_typ_counts))

# sort resource counts in descending order
Resource_count_dict_sorted = dict(sorted(Resource_count_dict.items(), key=lambda item: item[1], reverse=True)[:])

# delete exceptions
del Resource_count_dict_sorted['RRID:AB']
del Resource_count_dict_sorted['IMSR']

# add non-significant resource types together

others = ['MMRRC', 'RGD', 'DGGR', 'MGI','EXRC', 'NXR', 'NSRRC',  'TSC', 'AGSC' ]
sum_others_count = 0
for key in others:
    sum_others_count += Resource_count_dict_sorted.get(key)

# take the top 6 significant resource counts 
Top_6_resource_typs = dict(sorted(Resource_count_dict.items(), key=lambda item: item[1], reverse=True)[:6])

# add counts of insignificant resources to the dict
Top_6_resource_typs['Others'] = sum_others_count

label = list(Top_6_resource_typs.keys())
resource_type_counts = list(Top_6_resource_typs.values())
In [22]:
# visualize the shre of resources pie-chart

import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7), subplot_kw=dict(aspect="equal"))

# Data to plot
label = list(Top_6_resource_typs.keys())

explode = (0.1, 0.1, 0.09,1,1,1,1)  # explode 1st slice


plt.pie(resource_type_counts,explode=explode,autopct='%1.1f%%', shadow=False, startangle=180)

# annotate
plt.legend(['AB - Antibody',
            'SCR - Software/Databse',
            'CVCL- Cell Line',
            'BDSC - organism',
            'IMSR - Mouse Strain',
            'ADDGENE - Plasmids',
            'OTHERS - Various Organisms'], loc="best")

ax.set_title("Share of Resources types in all Journals")
plt.axis('equal')
fig.savefig("Plot: 2.Share of Resources.png", dpi = 130)
plt.show()

3.3.6.Top 10 journals by RRID  citation count

In [23]:
# read the csv created above into pandas data frame
Journal_resource_count = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_3.Journal_article_title_resource_count.csv')

# aggregate sum - add all rrid citations in each journal
Journal_resource_count_agg = Journal_resource_count.groupby('Journal').sum()

# calculate the total RRID citation in each Journal 
Journal_resource_count_agg["Total RRID in Journal"] = Journal_resource_count_agg.sum(axis=1)


# sort by by Total RRID Count
Journal_resource_count_agg_sorted = Journal_resource_count_agg.sort_values("Total RRID in Journal", ascending = False)

# take the top10 Journals by RRID Count

Top_10_Journal_resource_count_agg = Journal_resource_count_agg_sorted.head(10)
Top_10_Journal_resource_count_agg
Out[23]:
AB Count SCR Count CVCL count BDSC Count IMSR Count Addgene Count Total RRID in Journal
Journal
eLife 30498 13210 3800 2642 2077 1000 53227
Cell_Rep 22719 2067 405 219 125 145 25680
Cell 5689 46 215 0 29 70 6049
Gigascience 10 5957 26 0 2 0 5995
eNeuro 2750 1114 46 78 219 51 4258
Immunity 2860 350 60 0 35 15 3320
EBioMedicine 2530 208 163 0 26 14 2941
Cell_Stem_Cell 2094 486 11 0 0 0 2591
J_Endocr_Soc 2262 28 32 0 12 0 2334
PLoS_One 1595 94 39 14 21 24 1787
In [24]:
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

fig, ax = plt.subplots(figsize=(14, 7))

# PLoat Top 10 Journals by RRID citation count 

RRID_Count_ind = Top_10_Journal_resource_count_agg['Total RRID in Journal'].index
RRID_Count_val = Top_10_Journal_resource_count_agg['Total RRID in Journal'].values

ax.bar(RRID_Count_ind, RRID_Count_val)

# annotate the plot
ax.set_xticklabels(RRID_Count_ind, rotation=15)
ax.set_title("Top 10 journals by RRID citation count")
ax.set_ylabel("RRID Count")
ax.set_xlabel("Journals")

fig.savefig("Plot: 3.Top 10 Journals by RRID Count.png", dpi = 130)
plt.show()

3.3.7. Share of research resource types in top 10 journals

On the above graph, we see the total counts of RRID citations in the top 10 journals that have the most RRID citation counts. One might be interested to know the share of each type of resource type in each of the above 10 journals. The results are shown on the bar graph below.

In [25]:
#Using the data frame Top_10_Journal_resource_count_agg

import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

fig, ax = plt.subplots(figsize=(14, 7))

# PLoat AB Antibody 

AB_ind = Top_10_Journal_resource_count_agg['AB Count'].index
AB_val = Top_10_Journal_resource_count_agg['AB Count'].values

ax.bar(AB_ind, AB_val , label="AB-Antibody")

# Plot Software /DB

SCR_ind = Top_10_Journal_resource_count_agg['SCR Count'].index
SCR_val = Top_10_Journal_resource_count_agg['SCR Count'].values

ax.bar(SCR_ind, SCR_val,bottom= AB_val, label="SCR-Software/Database" )

# plot CVCL Count

CVCL_ind = Top_10_Journal_resource_count_agg['CVCL count'].index
CVCL_val = Top_10_Journal_resource_count_agg['CVCL count'].values

ax.bar(CVCL_ind, CVCL_val,bottom= AB_val +SCR_val , label="CVCL-Cell Line" )

# BDSC Count

BDSC_ind = Top_10_Journal_resource_count_agg['BDSC Count'].index
BDSC_val = Top_10_Journal_resource_count_agg['BDSC Count'].values

ax.bar(BDSC_ind, BDSC_val,bottom= AB_val +SCR_val+CVCL_val , label="BDSC-Organism" )

# IMSR Count

IMSR_ind = Top_10_Journal_resource_count_agg['IMSR Count'].index
IMSR_val = Top_10_Journal_resource_count_agg['IMSR Count'].values

ax.bar(IMSR_ind, IMSR_val,bottom= AB_val +SCR_val+CVCL_val+BDSC_val , label="IMSR-Mouse Strain" )

# Addgene Count

Add_ind = Top_10_Journal_resource_count_agg['Addgene Count'].index
Add_val = Top_10_Journal_resource_count_agg['Addgene Count'].values

ax.bar(Add_ind, Add_val,bottom= AB_val +SCR_val+CVCL_val+BDSC_val+IMSR_val , label="Addgene-Plasmids" )

# annotate the plot
ax.set_xticklabels(Top_10_Journal_resource_count_agg['AB Count'].index, rotation=15)
ax.set_title("Share of Resources in top 10 journals")
ax.set_ylabel("RRID Count")
ax.set_xlabel("Journals")
ax.legend()
fig.savefig("Plot: 4.Share of Resources in top 10 journals.png", dpi = 130)
plt.show()

3.3.8. Average research resource type  citation in all Journals

The above result shows the share of each resource type only in the top 10 journals that have the most RRID count. Here the average RRID citation of each type of research resource type in all 660 Journals have been calculated as follows:

  • Calculate the sum of each type of resource in all Journals
  • Find the average ( mean) of each type of resource for all JOurnals
  • Plot the bar chart
In [26]:
# calculate sum of each type of resource
Journal_resource_agg = Journal_resource_count.groupby('Journal').sum()

import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

fig, ax = plt.subplots(figsize=(14, 7))

# Average AB - antibody citation in all Journals
AB_Journal_resource_agg_avg = Journal_resource_agg[["AB Count"]].mean()
ax.bar("Antibody", AB_Journal_resource_agg_avg)

# Average software - SCR citation in all Journals
SCR_Journal_resource_agg_avg = Journal_resource_agg[["SCR Count"]].mean()
ax.bar("Software/DB", SCR_Journal_resource_agg_avg)

# Average Cell Line - CVCL citation in all Journals
CVCL_Journal_resource_agg_avg = Journal_resource_agg[["CVCL count"]].mean()
ax.bar("CVCL", CVCL_Journal_resource_agg_avg)

# Average BDSC citation in all Journals
BDSC_Journal_resource_agg_avg = Journal_resource_agg[["BDSC Count"]].mean()
ax.bar("BDSC", BDSC_Journal_resource_agg_avg)

# Average IMSR citation in all Journals
IMSR_Journal_resource_agg_avg = Journal_resource_agg[["IMSR Count"]].mean()
ax.bar("IMSR", IMSR_Journal_resource_agg_avg)

# Average Addgene citation in all Journals
Add_Journal_resource_agg_avg = Journal_resource_agg[["Addgene Count"]].mean()
ax.bar("Addgene", Add_Journal_resource_agg_avg)


ax.set_title("Avg research resource type citation in all Journals ")
ax.set_ylabel("Average RRID ")
ax.set_xlabel("Popular research resources ")

fig.savefig("Plot: 5.Avg research resource type citation in all Journals.png", dpi = 130)
plt.show()

3.3.9. Top 10 Journals by RRID Per capita

Since the number of publications is not the same in all journals, RRID per capita is used to measure the density of RRID citations in each Journal.

RRID_Percapita = Number of RRID Citations in a Journal / Total number of XML publications in a Journal

In [27]:
# read files created before 
with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_1.Journal_xml_mapping_dict.json', "r") as read_file:
    Journal_xml_mapping = json.load(read_file)

with open('/home/bk378/t20201023-RRID_analysis-BK/Files/jsons/_7b.Journal_RRID_count.json', "r") as read_file:
    Journal_RRID_count = json.load(read_file)

# Calculate the RRID Percapita

rrid_percap_dict = {}
for vals in list(Journal_xml_mapping):

    if vals in list(Journal_RRID_count):
        rrid_percap_dict[vals] = round((Journal_RRID_count.get(vals) /len(list(Journal_xml_mapping.get(vals)))),2)

# Top 10 RRID Percapita Journals
Top_10_Journals_RRID_parcapita = dict(sorted(rrid_percap_dict.items(), key=lambda item: item[1], reverse=True)[:10])
In [28]:
import matplotlib.pyplot as plt

# style of plot
plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14, 7))

x = list(Top_10_Journals_RRID_parcapita.keys())
y = list(Top_10_Journals_RRID_parcapita.values())

ax.barh(x, y, align='center')

# annotate the plot

ax.set_yticklabels(list(Top_10_Journals_RRID_parcapita.keys()), )
ax.set_ylabel("Journals")
ax.set_xlabel("RRID percapita")
ax.set_title("Top 10  Journals by RRID Percapita")
ax.invert_yaxis()
fig.savefig("Plot: 6.Top 10 Journals by RRID Percapita.png", dpi = 130)
plt.show()
  • Immunity journal has the highest RRID per capita, has more RRID citations in each of its publications on average.

3.3.10. Top Journals by Software resource citation

  • e-Life and gigascience journals use software research resources more than any other journal in PubMed.
In [29]:
# read the csv created above into pandas data frame
Journal_resource_count = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_3.Journal_article_title_resource_count.csv')

# aggregate sum - add all rrid citations in each journal
Journal_resource_count_agg = Journal_resource_count.groupby('Journal').sum()


# sort by by Total software resource count
Journal_resource_count_agg_sorted = Journal_resource_count_agg.sort_values("SCR Count", ascending = False)
Journal_resource_count_agg_sorted

# take the top10 Journals by software Count
Top_10_Journal_SCR_count_agg = Journal_resource_count_agg_sorted[["SCR Count"]].head(10)

#Using the data frame Top_10_Journal_resource_count_agg
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

fig, ax = plt.subplots(figsize=(14, 7))

# PLoat AB Antibody 

SCR_ind = Top_10_Journal_SCR_count_agg['SCR Count'].index
SCR_val = Top_10_Journal_SCR_count_agg['SCR Count'].values
ax.barh(SCR_ind, SCR_val , label="SCR Count")

# annotate the plot
ax.set_yticklabels(SCR_ind)


# annotate 
ax.set_title("Top 10 Journals by Software citation Count")
ax.set_xlabel("RRID Counts")
ax.set_ylabel("Journals")
ax.invert_yaxis()

fig.savefig("Plot: 7.Top 10 Journals by Software Resource Count.png", dpi = 130)
plt.show()

Top Journals by distinct Software Count

A given resource might be cited multiple times in a given journal. A single instance of software resource citations has been counted in each journal and plotted on the bar graph below.

In [30]:
Journal_unique_scr_count = {}
for journal_ in Journal_RRID_dict:
    scr_list = []
    for rrid_ in Journal_RRID_dict.get(journal_):
        if 'SCR' in rrid_:
            if rrid_ not in scr_list:
                scr_list.append(rrid_)
                #print(journal_,rrid_)
    Journal_unique_scr_count[journal_] = len(scr_list)


# take the top 10 journals by unique software resource citation count    
Top10_Journals_by_unique_scr_count = dict(sorted(Journal_unique_scr_count.items(), key=lambda item: item[1], reverse=True)[:10])

import matplotlib.pyplot as plt

plt.style.use("bmh")
fig, ax = plt.subplots(figsize=(14,7))

ax.barh(list(Top10_Journals_by_unique_scr_count.keys()), list(Top10_Journals_by_unique_scr_count.values()))

# annotate
ax.set_title("Top 10 Journals by distinct Software Resource Count")
ax.set_xlabel("RRID Counts")
ax.set_ylabel("Journals")
ax.invert_yaxis()

fig.savefig("Plot: 8 Top 10 Journals by distinct Software Resource Count.png", dpi = 130)
plt.show()

3.3.11. Is RRID citation practice becoming popular?

In [31]:
# read data from csv using panda df

import pandas as pd

RRID_citations_by_publication_date = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_1.Journal_xml_rriCount_ePubDate.csv')
Software_citations_by_publication_date = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_2.Journal_xml_SCR_RRID_with_article_title.csv')

# set index to date time
RRID_citations_by_publication_date= RRID_citations_by_publication_date.set_index('epub_Date')
Software_citations_by_publication_date = Software_citations_by_publication_date.set_index('epub_Date')

# chnange the index to date time object
RRID_citations_by_publication_date.index = pd.to_datetime(RRID_citations_by_publication_date.index)
Software_citations_by_publication_date.index = pd.to_datetime(Software_citations_by_publication_date.index)

#Sort by Date
RRID_citations_by_publication_date = RRID_citations_by_publication_date.sort_index(ascending = True)
Software_citations_by_publication_date = Software_citations_by_publication_date.sort_index(ascending = True)

Software_citations_by_publication_date.head()
#Software_citations_by_publication_date
Out[31]:
Journal xml_file_name article_title RRID mentions RRID Count
epub_Date
2012-05-04 PLoS_One PMC3320645.nxml Molecular Dynamics Simulation Study and Hybrid... ['SCR_00085'] 1
2013-04-25 PLoS_One PMC3636146.nxml Molecular Modeling Study for Inhibition Mechan... ['SCR_00479'] 1
2014-08-26 BMC_Fam_Pract PMC4254388.nxml Prostate-specific antigen (PSA) screening and ... ['SCR_5250', 'SCR_5250'] 2
2014-11-01 J_Air_Transp_Manag PMC7148841.nxml The design of light-handed regulation of airpo... ['SCR_4153', 'SCR_4153'] 2
2014-11-01 J_Air_Transp_Manag PMC7148841.nxml The design of light-handed regulation of airpo... ['SCR_4153', 'SCR_4153'] 2
In [33]:
#resample('M')  -- Monthly average
#resample('Y')  -- Annual average

# sample yearly Y
Yearly_average_RRID_citations = RRID_citations_by_publication_date.resample('M')[["RRID Count"]].mean().fillna(0)
Yearly_average_software_citations = Software_citations_by_publication_date.resample('M')[["RRID Count"]].mean().fillna(0)

# plotting our data
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

# fig size
fig, ax = plt.subplots(figsize=(14,7))

# PLOT RRID citation counts in all Journals

dates = Yearly_average_RRID_citations.index
rrids = Yearly_average_RRID_citations.values
ax.plot(dates,rrids , alpha=0.9, label="all RRIDs (including Software)")


# PLOT software citation counts in all Journals

scr_rrids = Yearly_average_software_citations.values
scr_dates = Yearly_average_software_citations.index
ax.plot(scr_dates, scr_rrids, color='r', alpha=0.9, label="software")

# annotate the plot
ax.legend(loc="upper left")
ax.set_xlabel('Time (Years)')
ax.set_ylabel('Mothly Average RRID count ')
ax.set_title("RRID Vs Software Monthly citation all Journals ")

fig.savefig("Plot: 9M.RRID Vs Software Monthly average citation all Journals.png", dpi = 250)
plt.show()

Use of RRIDs over the years in PLoS One Journal

In [34]:
RRID_citations_by_publication_date_PLOS =RRID_citations_by_publication_date[RRID_citations_by_publication_date['Journal'] == 'PLoS_One']
Software_citations_by_publication_date_PLOS = Software_citations_by_publication_date[Software_citations_by_publication_date['Journal'] == 'PLoS_One']


Yearly_average_RRID_citations_plos = RRID_citations_by_publication_date_PLOS.resample('Y')[["RRID Count"]].mean().fillna(0)
Yearly_average_software_citations_plos = Software_citations_by_publication_date_PLOS.resample('Y')[["RRID Count"]].mean().fillna(0)


# plotting our data
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

# fig size
fig, ax = plt.subplots(figsize=(14,7))

# PLOT RRID citation counts in PLoS One

dates = Yearly_average_RRID_citations_plos.index
rrids = Yearly_average_RRID_citations_plos.values
ax.plot(dates,rrids , marker="o", alpha=0.5, label="all RRIDs (including Software)")


# PLOT software citation counts in PLoS One

scr_rrids = Yearly_average_software_citations_plos.values
scr_dates = Yearly_average_software_citations_plos.index
ax.plot(scr_dates, scr_rrids, marker="o", color='r', alpha=0.5, label="software")

# annotate the plot
ax.legend(loc="upper right")
ax.set_xlabel('Time (Years)')
ax.set_ylabel('Annual Average RRID count ')
ax.set_title("Use of RRIDs over the years in PLoS One Journal")

fig.savefig("Plot: 10.RRID vs Software annual average citation in PLoS One.png", dpi = 250)
plt.show()

Use of RRIDs over the years in eLife Journal

In [35]:
RRID_citations_by_publication_date_eLife =RRID_citations_by_publication_date[RRID_citations_by_publication_date['Journal'] == 'eLife']
Software_citations_by_publication_date_eLife = Software_citations_by_publication_date[Software_citations_by_publication_date['Journal'] == 'eLife']

Yearly_average_RRID_citations_eLife = RRID_citations_by_publication_date_eLife.resample('M')[["RRID Count"]].mean().fillna(0)
Yearly_average_software_citations_eLife = Software_citations_by_publication_date_eLife.resample('M')[["RRID Count"]].mean().fillna(0)

# plotting our data
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

# fig size
fig, ax = plt.subplots(figsize=(14,7))

# PLOT RRID citation counts in eLife

dates = Yearly_average_RRID_citations_eLife.index
rrids = Yearly_average_RRID_citations_eLife.values
ax.plot(dates,rrids, alpha=0.9, label="all RRIDs (including Software)")


# PLOT software citation counts in eLife

scr_rrids = Yearly_average_software_citations_eLife.values
scr_dates = Yearly_average_software_citations_eLife.index
ax.plot(scr_dates, scr_rrids, color='r', alpha=0.9, label="software")

# annotate the plot
ax.legend(loc="upper right")
ax.set_xlabel('Time (Years)')
ax.set_ylabel('Monthly Average RRID count ')
ax.set_title("RRIDs over the years in eLife Journal")

fig.savefig("Plot: 11M.RRID vs Software Monthly average citation in eLife.png", dpi = 250)
plt.show()

3.3.12. What is the first software cited by RRID ?

In [36]:
# read data from csv using panda df

import pandas as pd

SCR_citations_by_publication_date = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_2.Journal_xml_SCR_RRID_with_article_title.csv')

# set index to date time
SCR_citations_by_publication_date= SCR_citations_by_publication_date.set_index('epub_Date')

# chnange the index to date time object
SCR_citations_by_publication_date.index = pd.to_datetime(SCR_citations_by_publication_date.index)

#Sort by Date
SCR_citations_by_publication_date = SCR_citations_by_publication_date.sort_index(ascending = True)

SCR_citations_by_publication_date.head(6)
#Software_citations_by_date
Out[36]:
Journal xml_file_name article_title RRID mentions RRID Count
epub_Date
2012-05-04 PLoS_One PMC3320645.nxml Molecular Dynamics Simulation Study and Hybrid... ['SCR_00085'] 1
2013-04-25 PLoS_One PMC3636146.nxml Molecular Modeling Study for Inhibition Mechan... ['SCR_00479'] 1
2014-08-26 BMC_Fam_Pract PMC4254388.nxml Prostate-specific antigen (PSA) screening and ... ['SCR_5250', 'SCR_5250'] 2
2014-11-01 J_Air_Transp_Manag PMC7148841.nxml The design of light-handed regulation of airpo... ['SCR_4153', 'SCR_4153'] 2
2014-11-01 J_Air_Transp_Manag PMC7148841.nxml The design of light-handed regulation of airpo... ['SCR_4153', 'SCR_4153'] 2
2015-05-06 F1000Res PMC4648211.nxml The Resource Identification Initiative: A cult... ['RRID:SCR_013827', 'RRID:SCR_013827'] 2

The first software resource cited using RRID was RRID:SCR_013827, a Statistics Calculator. Interestingly, the publication was published by the Resource Identification Initiative (RII), the same initiative that introduced RRIDs in 2014. The publication can be found here. The other results prior to 2014, SCR_00085 and SCR_00479 are not RRIDs, but they refer to something known as a Ligand.

Counting of all unique instances of RRIDs from a list of all 152,598 RRIDs has been done. 

In [37]:
# count all unique RRIDs from all_rrid_list

import numpy as np

unique_labels, unique_counts = np.unique(all_rrid_list, return_counts=True)
Resource_count_dict = dict(zip(unique_labels, unique_counts))

# take the top 10 resources which has the most frequency of citation
Top10_Resource_count_dict = dict(sorted(Resource_count_dict.items(), key=lambda item: item[1], reverse=True)[:10])

# visualize popular software
from collections import Counter

import matplotlib.pyplot as plt
from wordcloud import WordCloud

cloud = WordCloud(max_font_size=400, width=1078,
                  height=720,
                  background_color="black",
                  colormap="hsv").generate_from_frequencies(Resource_count_dict)

plt.figure(figsize=(12,14))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')

cloud.to_file('Plot: 15.Popular software.png')
plt.show()

The most popular software cited using RRIDs are:

  • GraphPad Prism SCR_002798) Statistical analysis software.

  • ImageJ (SCR_003070) : Open source Java based image processing software.

  • MATLAB (SCR_001622) : Multi paradigm numerical computing environment.

  • Fiji (SCR_002285) : Software package as distribution of ImageJ.

  • R Project for Statistical Computing (SCR_001905) : Software environment and programming language.
In [38]:
Popular_software = pd.read_csv('/home/bk378/t20201023-RRID_analysis-BK/Files/CSVs/_4.Popular softwares citation count in journals.csv')

# set index to date time
Popular_software= Popular_software.set_index('epub_Date')

# chnange the index to date time object
Popular_software.index = pd.to_datetime(Popular_software.index)

#Sort by Date
Popular_software_sorted = Popular_software.sort_index(ascending = True)
Popular_software_sorted.head()
Out[38]:
Journal xml_file_name article_title Prism ImageJ Matlab Fiji R
epub_Date
2016-01-02 Brain_Behav PMC4831416.nxml Effects of progesterone on hyperoxia‐induced d... 1 0 0 0 0
2016-01-06 J_Comp_Neurol PMC5116274.nxml Extraocular motoneuron pools develop along a d... 0 0 0 2 0
2016-01-08 Brain_Behav PMC5064343.nxml The feasibility of 11C‐PIB‐PET/CT for amyloid ... 0 0 0 0 1
2016-01-11 eNeuro PMC5089537.nxml Striatal Activity and Reward Relativity: Neura... 0 0 1 0 1
2016-01-12 eLife PMC5167519.nxml The evolutionary origin of bilaterian smooth a... 0 2 0 0 0
In [39]:
# Calculate the average citations
avg_Prism = Popular_software_sorted.resample('Y')[["Prism"]].mean().fillna(0)
avg_ImageJ = Popular_software_sorted.resample('Y')[["ImageJ"]].mean().fillna(0)
avg_Matlab = Popular_software_sorted.resample('Y')[["Matlab"]].mean().fillna(0)
avg_Fiji = Popular_software_sorted.resample('Y')[["Fiji"]].mean().fillna(0)
avg_R = Popular_software_sorted.resample('Y')[["R"]].mean().fillna(0)

# plotting our data
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

# fig size
fig, ax = plt.subplots(figsize=(14,7))

# plot Prism software citation over years

Prism_dates = avg_Prism.index
Prism_rrids = avg_Prism.values
ax.plot(Prism_dates,Prism_rrids , alpha=1, label="1. Prism")

# plot ImageJ software citation over years

ImageJ_dates = avg_ImageJ.index
ImageJ_rrids = avg_ImageJ.values
ax.plot(ImageJ_dates,ImageJ_rrids , alpha=1, label="2. ImageJ ")

# plot Annual_avg_Matlab software citation over years

Matlab_dates = avg_Matlab.index
Matlab_rrids = avg_Matlab.values
ax.plot(Matlab_dates,Matlab_rrids , alpha=1, label="3. Matlab ")

# plot Fiji software citation over years

Fiji_dates = avg_Fiji.index
Fiji_rrids = avg_Fiji.values
ax.plot(Fiji_dates,Fiji_rrids , alpha=1, label="4. Fiji ")

# plot R software citation over years

R_dates = avg_R.index
R_rrids = avg_R.values
ax.plot(R_dates,R_rrids , alpha=1, label="5. R ")

# annotate the plot

ax.legend(loc="upper right")
ax.set_xlabel('Time (Years)')
ax.set_ylabel(' avg.  citation count ')
ax.set_title("Avg. citation of popular software over years ")
ax.set_yscale('linear')

fig.savefig("Plot: 12.citation of popular software over years.png", dpi = 250)
plt.show()
In [40]:
# group by Journal total software citations
Popular_software_sorted_agg = Popular_software_sorted.groupby('Journal').sum()
Popular_software_sorted_agg
Out[40]:
Prism ImageJ Matlab Fiji R
Journal
ASN_Neuro 8 4 0 0 0
Acta_Neuropathol_Commun 2 3 0 0 1
Aging_(Albany_NY) 1 0 0 0 0
Alzheimers_Dement_(Amst) 0 0 0 0 2
Am_J_Physiol_Renal_Physiol 0 1 0 0 0
... ... ... ... ... ...
Traffic 1 0 1 0 0
Transl_Vis_Sci_Technol 0 2 0 0 0
Wellcome_Open_Res 5 1 0 2 0
eLife 865 644 505 552 317
eNeuro 75 56 105 34 29

167 rows × 5 columns

In [41]:
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

fig, ax = plt.subplots(figsize=(14, 7))

# Average Prism citation in all Journals
Prism_software_sorted_agg = Popular_software_sorted_agg[["Prism"]].mean()
ax.bar("Prism", Prism_software_sorted_agg)

# Average ImageJ citation in all Journals
ImageJ_software_sorted_agg = Popular_software_sorted_agg[["ImageJ"]].mean()
ax.bar("ImageJ", ImageJ_software_sorted_agg)

# Average Matlab citation in all Journals
Matlab_software_sorted_agg = Popular_software_sorted_agg[["Matlab"]].mean()
ax.bar("Matlab", Matlab_software_sorted_agg)

# Average Fiji citation in all Journals
Fiji_software_sorted_agg = Popular_software_sorted_agg[["Fiji"]].mean()
ax.bar("Fiji", Fiji_software_sorted_agg)

# Average R citation in all Journals
R_software_sorted_agg = Popular_software_sorted_agg[["R"]].mean()
ax.bar("R", R_software_sorted_agg)

ax.set_title("Average citation count of popular software in Journals")
ax.set_ylabel("Average RRID ")

fig.savefig("Plot: 13.Average citation count of popular software in Journals.png", dpi = 250)
plt.show()
In [42]:
Popular_software_sorted_agg_sorted = Popular_software_sorted_agg.sort_values(['Prism','ImageJ','Matlab', 'Fiji', 'R'] , ascending = False)

#TOP 10 journals by popular software citation count
top10_Journals_by_Popular_software_citation_count = Popular_software_sorted_agg_sorted.head(10)
top10_Journals_by_Popular_software_citation_count
Out[42]:
Prism ImageJ Matlab Fiji R
Journal
eLife 865 644 505 552 317
Cell_Rep 215 90 32 100 42
Curr_Biol 82 82 0 82 0
eNeuro 75 56 105 34 29
Br_J_Pharmacol 61 30 3 2 1
Cell 54 0 0 0 0
Immunity 26 0 0 0 0
Brain_Behav 22 9 27 2 14
Front_Mol_Neurosci 20 13 6 2 0
Cell_Rep_Med 20 0 0 0 4
In [43]:
import matplotlib.pyplot as plt

# plot style
plt.style.use("bmh")

fig, ax = plt.subplots(figsize=(14,7))

# PLoat Prism software 
Prism_ind = top10_Journals_by_Popular_software_citation_count['Prism'].index
Prism_val = top10_Journals_by_Popular_software_citation_count['Prism'].values
ax.bar(Prism_ind, Prism_val , label="Prism")

# Plot ImageJ software
ImageJ_ind = top10_Journals_by_Popular_software_citation_count['ImageJ'].index
ImageJ_val = top10_Journals_by_Popular_software_citation_count['ImageJ'].values
ax.bar(ImageJ_ind, ImageJ_val,bottom= Prism_val, label="ImageJ" )

# plot Matlab Count
Matlab_ind = top10_Journals_by_Popular_software_citation_count['Matlab'].index
MatlabL_val = top10_Journals_by_Popular_software_citation_count['Matlab'].values
ax.bar(Matlab_ind, MatlabL_val,bottom= Prism_val+ImageJ_val , label="Matlab" )

# plot Fiji Count
Fiji_ind = top10_Journals_by_Popular_software_citation_count['Fiji'].index
Fiji_val = top10_Journals_by_Popular_software_citation_count['Fiji'].values
ax.bar(Fiji_ind, Fiji_val,bottom= Prism_val+ImageJ_val+MatlabL_val , label="Fiji" )

# plot R Count
R_ind = top10_Journals_by_Popular_software_citation_count['R'].index
R_val = top10_Journals_by_Popular_software_citation_count['R'].values
ax.bar(Fiji_ind, Fiji_val,bottom= Prism_val+ImageJ_val+MatlabL_val+Fiji_val , label="R" )

# annotate the plot
ax.set_xticklabels(top10_Journals_by_Popular_software_citation_count.index, rotation=15)
ax.set_title("Share of popular software in top 10 journals")
ax.set_ylabel("Software Count")
ax.set_xlabel("Journals")
ax.legend()

fig.savefig("Plot: 14.Share of popular software in top 10 journals.png", dpi = 250)
plt.show()

3.4. Summary  and Future Work



 Summary

The analysis of publications above reveal the following facts:

  • Most researchers used RRID citations in the correct syntax most of the time, even though some degree of inconsistency of RRID use has been observed on few cases. About 5% of RRID citations do not have the RRID-Part of the syntax. Few authors ignored the RRID-Part, introduced unnecessary white space, or made a mistake citing the resources like a typo or wrong key.
  • Only a few journals have adopted RRID citations in PubMed. Out of over 15,259 journals, only 660 have RRID citations. This applies to the number of publications as well, out of over 3.2 million publications only 10,082 publications have RRIDs. This indicates that RRIDs have not become popular yet among many journals.
  • 152,598 RRID citations have been found in 660 PubMed Journals. Most of these are recurring research resource citations of 45,220 distinct types of research resources. 95% of these research resources are found on the Sci-Crunch. This result shows that most of RRID citations worked well to uniquely identify research resources, guaranteeing reproducibility of the result of researchers work. The rest of the cases where RRIDs are missing on the Sci-Crunch (5%) translates to authors mistake while using RRIDs or other keys which are not RRIDs.
  • About 90 % of research resources of the research resources used in the pubMed journals are Antibodies (AB) 69 % and software/Database resources (19.6 %).
  • In PubMed, even if PLoS one is the most popular journal by the number of publications, e-Life journal is the one with the most number of RRID citations.  However, the Immunity journal has the highest RRID per capita, more RRID citations per publication.
  • Among the top 10 journals with most RRID citations, Giga-science journal interestingly uses only software resources, not even one other type of research resource is found in Gigascience journal. In all other journals, fairly all kinds of research resources like antibodies, software, Cell-line, plasmids, etc. have been used in a proportion of:
    • 69.1 % Antibody (AB),
    • 19.6 % software/DB (SCR),
    • 4.5 % Cell Line (CVCL),
    • 2.4 % Organism (BDSC),
    • 2.3 % mouse strain (2.3%),
    • 1.5 % Plasmids(Addgene), and
    • 0.6 % others.
  • The use of RRIDs for citation of research resources has become popular over the past 5 years nearly in all 660 PubMed Journals out of 15,000 journals. This is true for software resources as well.
  • Statistical analysis and image processing software, like GraphPad Prism, ImageJ,Matlab, Fiji and R, have been extensively used by researchers. This makes sense because in life science and biomedical research most of the time the researchers have to analyze images and perform numerical analysis during their experiment.  The other interesting insight is, for some reason, Prism software have became very popular especially since 2019 where as Matlab have became less popular.

 Future Work

In this project, the publications used for analysis of research resources are in the XML file format. When analyzing the use of RRIDs by authors' it was found earlier that RRID-Part from the RRID:prefix_xxxxx is embedded in the XML tag as an XML attribute separated from prefix_xxxx of RRID syntax. However, our regx formula captured only instances of resource citation where the RRID-Part occurs exactly following the RRID:prefix_xxxxx syntax. This has led to a false result indicating that most of the authors did not include RRID-Part in their RRID citation.

These inconsistencies in results have been found by manually comparing few samples of PDF publication files with XML (as shown in 3.  How often do authors use RRIDs intentionally?). In future work, the citation of RRIDs can be compared in XML and PDF files in an automated way. This can be done by extracting attribute information from the XML tag and combining it with the rest of the RRID part to re-construct the actual RRID citation pattern, the way originally the author cited the resource.

The other thing, before creating RRIDs it would be great to perform text mining to check if an RRID resembling key has been already been assigned to another type of non RRID resource. During this project, two papers 1 and 2 have been found with a resource citation SCR_00085 and SCR_00479. These keys resemble an RRID citation without the RRID-Part but they are not actually RRIDs, rather they are Ligands.

By chance, if any software resource, not assigned an RRID key before, is given one of these keys or keys like those, this is going to create a conflict and ambiguity. Therefore for future work, text mining for keys has to be done before assigning a new RRID key to check if the key has already been used by other non-RRID types of resource. This is very critical given the fact that only 660 journals out of more than 15,000 journals have used RRIDs for the citation of research resources so far.

Back to Cover